cBioPortal Interface and cBioPortal Extraction with R

Purpose

At the end of this session, you will be able to:

  1. Exctract cBioPortal information by the online interface.
  2. Exctract cBioPortal information by R with the cbioportalR R package.

Introduction

cBioPortal is a tool that allows you to explore and visualize molecular data such as DNA, RNA, proteins, which come from multiple studies whose processed data have been made available to the scientific community.
You can see the different studies publicly available here: https://www.cbioportal.org/

cBioPortal is distributed under a public license, meaning that the code is available and we can install a cBioPortal locally, therefore having our own cBioPortal.
You can see the list of available cBioPortals (local and public) here: https://installationmap.netlify.app/
existing_instances

Thus, Gustave Roussy has deployed an internal local cBioPortal, meaning that the information is only available to Gustave Roussy employees when connected to the GR_Intern network: https://cbioportal.intra.igr.fr
By default, no study is available. You must make a request for access to the study of interest to the DAC, using this form. Then, as soon as you have the green light from the DAC, you can contact the bioinformatics platform () to obtain your access.
To get the updated list of available studies, sent an email to .

For this practical, we will use the public cBioPortal https://www.cbioportal.org/, but the principle is the same with another instance of cBioPortal.

Extract cBioPortal information by the online interface

Connexion

To use the interface go to the cBioPortal website: https://www.cbioportal.org/

Main interface

main_interface

On the main page of the site we can see:
- a menu bar at the top,
- a request form at the center,
- an insert on the right where the new features and some example queries are listed.

The request form lists the studies from Gustave Roussy which are accessible via the cBioPortal.
Currently 5 studies are available.
We can see the name of the studies, followed by the number of samples for each study, the i logo gives a short description of the study, the book logo is a link to the publication, then the logo of the pie chart is a shortcut to the study summary page.

Data Sets tab

data_sets_tab

This tab lists the studies from Gustave Roussy which are accessible via the cBioPortal.
These are the same studies as before.
We can see the name of the studies, the link to the publication, the total number of samples, then the number of samples for each kind of data.
Here we have “Mutations”, “Copy Number Alteration” (CNA), and “RNA-seq” data.

But if you click on the drop-down menu on the right, you can see the other types of data available.
data_sets_tab_col_menu

Study selection

study_selection

From the request form (click on cBioPortal logo to come back to the Main interface), we can either:
- select a study of interest by clicking directly on the pie chart logo of the desired study line.
- select one, several or all, studies by checking them, then click on the “Explore Selected Studies” button on the form.

From the Data Sets tab, we can click on the name of the study of interest.

Of course if there are too many studies available, you can use the keyword search form.
Regardless of how you select your study of interest, you will be redirected to the same page.
This new page contains a general banner at the top, displaying the name of the study (a small download icon to the right of the name allows you to download all the clinical and genomic data available for this study), the number of patients and the number of samples.

The various additional tabs and buttons on this page are detailed below.

Summary tab

summary_tab

The summary tab allows you to summarize the data from the selected study in a few graphs and tables.
This tab is made up of several elements that can be moved, enlarged (bottom right corner) or deleted.
The elements are dynamic, that is to say that when you pass the mouse over them, displays can appear. Likewise, in the upper right corner of each element, a menu may appear offering several options (deletion of the element, download, additional information,…).

Tables

summary_tab_tables

Tables have 3 main columns:
-the first column: its name varies depending on the information listed in the table (“Molecular Profile”, “Genes”, “Categories”, etc.).
- the # column (for the number), corresponds to the number of samples with the characteristic of the first column.
- the Freq column (for Frequency), corresponds to the percentage of samples with the characteristic of the first column.

In some specific table instances we may have other additional columns:
- Mutated Genes table:
- # Mut: total number of gene mutations. It may be higher than the number of samples with this mutation because one sample may have multiple mutations for this gene.
- Structural Variant Genes table:
- # SV: total number of structural variants of the gene. Likewise, it may be higher than the number of samples with that structural variant because a sample may have multiple structural variants for that gene.
- CNA Genes table:
- Cytoband: genomic region where the CNA is located (cytoband).
- CNA: the type of CNA (AMP: amplification; HOMDEL: homozygous deletion).

Note that you can filter the tables using the search bar at the bottom of the element, sort in alphabetical or numerical order if you click on the column names.

Graphs

Several types of graphs are available to represent the data: pie charts, histograms, lineplots. or scateplots.

Pie charts allow you to represent discrete data such as gender, ethnicity, number of samples per patient, sample type, etc.

Histograms present continuous data such as age at diagnosis, fraction of altered genome, days to sample collection, etc.

Lineplots present continuous data such as Overall survival, Kaplan-Meier of disease free, etc.

Scaterplots compare 2 continuous data, such as the number of mutations as a function of the fraction of altered genome. This type of graph calculates a correlation score between the 2 variables compared (Spearman and Pearson), as well as the associated p-values.

summary_tab_graphs

Note that some graph displays the number of NA values. This is the number of samples for which we do not have the requested information.

More graphes and tables

Not all tables and graphs are displayed by default.
You can click on the Charts button at the top right, wander through the different menus (“Clinical”, “Genomics”…) to add graphs and even test the X to Y to make scaterplots, boxplots and violinplots with variables of your choice.

summary_tab_more_graphes_and_tables

Sample selection

To make a selection of samples you can click in the tables or in the additional displays given by the graphs when we hover our mouse over them.

summary_tab_sample_selection1

If you already have a list of patients or samples of interest you can use the “Custom Selection” button and put your list there.

summary_tab_sample_selection2

Note that as soon as you have selected according to a criterion, all the tables/graphs are automatically updated to only present the information related to this selection. Of course you can select according to several criteria.

Clinical Data tab

This tab presents the clinical characteristics of the patients/samples. The available columns depend on the studies.

clinical_data_tab

The Charts button in the Summary tab has become Columns and allows you to show or hide columns.

clinical_data_tab_col_menu

There is also a search bar and a download table button.

This tab is up to date with the sample selections made previously.

Other tab

Other tabs may be available depending on the studies, for example in the study Glioblastoma (TCGA, Cell 2013).

In particular, the Heatmap tab which allows you to view heatmaps already made of the public cBioPortal). The menu on the left allows you to select a heatmap of interest among those proposed.

other_tab_heatmap1

We can slightly customize these heatmaps by clicking on them which will redirect us to an editor. In this editor the left display (“Heat Map Detail”) is a zoom of the complete heatmap (“Heat Map Summary”) which is presented on the right. The verse element on the right heatmap corresponds to what we see on the left heatmap.

other_tab_heatmap2

You can change certain heatmap parameters using the “Parameters” button.

other_tab_heatmap3

The “CN Segments” tab allows you to explore the copy number variations along the chromosomes with an IGV type display. The color represents the number of copies and each line corresponds to a patient. There are also settings buttons to display only a chromosome or a region, change colors, save the plot,…

other_tab_CN_segments

You can zoom by double clicking.

And other tab types, including CT Scan, and probably others that I didn’t come across while creating this course.

other_tab_CT_scan1

other_tab_CT_scan2

The “Beta Plots!” tab allows you to make your own comparison graphs of 2 variables, in a similar way with the Charts button of the “Summary” tab, but here we can also compare genomic data. The type of graph depends on the information to be compared.

other_tab_plots_beta

Groups comparisons

Once you have selected your samples of interest, you can put them in a group with the Groups button which will save your selection. So you can make several groups.
Group comparison allows you to compare their clinical and molecular characteristics.

For example, return to the “Summary” tab, select all the “Male” and make them a group then do the same with the “Female”.

groups_comparisons1

groups_comparisons2

groups_comparisons3

groups_comparisons4

You see that our groups are displayed with a color (here pink for “Female” and blue for “Male”).
Then still in the Groups button, select the 2 groups that we have just made, then click on Compare.

groups_comparisons5

A new page opens with new tabs. The number and type of tabs depends on the availability of data from the selected studies.

Overlap tab

The “Overlap” tab allows you to know if you have patients or samples that are present in both groups at the same time (which can happen if you have made the selection on patients with several samples for example), thanks to a cubic Venn graph.

groups_comparisons_overlap_tab

Overall survival and Clinical tabs

The “Survival” tab compares the overall survival of the 2 groups with a Kaplan-Meier graph.

groups_comparisons_overall_survival_and_clinical_tabs1

The clinical tab compares the clinical data between the 2 groups with appropriate statistical tests and presents graphs. Results are sorted by significance and significant results are in bold.
As usual, you can select the columns to display, download the results table and graphs, search by word leader, change the graph display.

groups_comparisons_overall_survival_and_clinical_tabs2

Note: whether for the Survival or Clinical tab, there is a banner indicating that it is necessary to “Interpret all results with caution, as they can be confounded by many different variables that are not controlled for in these analyses. Consider consulting a statistician.”. So the statistical tests carried out here can give us a first insight but it is better to discuss with a specialist before making any conclusions.

Genomic Alteration tab

The “Genomic Alteration” tab allows you to compare mutations between the 2 groups, in particular the frequency of alteration overall or for certain genes. As usual, you have additional displays if you hover your mouse over the graphs, you can download the graphs and change which genes are displayed.

groups_comparisons_genomic_alteration_tab

At the bottom you have a comparison table for each alteration. If you hover over the column names you have explanations displayed, and you with the usual selection and save options.

Mutations Beta! tab

The “Beta Mutations!” tab allows you to compare the mutations between the 2 groups, by representing them along the protein domains. Mutations are represented above or below the protein axis depending on the group of patients to which they belong.

groups_comparisons_mutations_beta_tab

By default it offers you the proteins with the highest mutation frequency of their gene, but you can choose your protein/gene of interest. You can also display additional annotations by clicking on the “Add annotation Tracks” button (they are displayed at the bottom of the graph). You can view the legend by clicking the “Legend” button at the top right.
Do not hesitate to hover your mouse over the graphic elements, the displays are dynamic.
Similar to the “Genomic Alteration” tab, at the bottom you have a table which summarizes the protein changes and in which group it is enriched, with a score and significance. Note the presence of the “Annotation” column which gives you additional information from databases (OncoKB, CIViC,…).

mRNA, Protein, DNA Methylation tabs and other data

The “mRNA” tab allows you to carry out a differential expression analysis between our 2 groups. The graph displayed by default is a volcanoplot. There is also a table at the bottom with the results of the differential analysis. It is sorted by significance (q-Value) and significant genes are in bold.
If we click on a gene in this table, a new graph appears with the expression levels in boxplot format.

groups_comparisons_mRNA_protein_and_DNA_methylation_tabs

Note if you hover over the name of the column named “p-Value”, it indicates that the test used is a Student’s t-test. This is a very good example of the fact that it is necessary to seek the advice of a statician before making the interpretation, because the use of a t-test requires having a lot of samples to compare and other statistical tests are more suitable.

The “Protein” and “DNA Methylation” tabs are identical to the “mRNA” tab but for proteins or methylation.

Other data can be available like Arm-level CNA or Genetic Ancestry, depending on the study.

Search by genes

Instead of selecting the samples/patients, then doing the analyses/tables/graphs, then searching for your genes of interest in each analysis/table/graph, you can indicate your genes from the start and get results already filtered.

To do this, we return to the very first page of the cBioPortal (by clicking on the logo at the top left of the page), we select the study (or studies) of interest, then we click on “Query By Gene”.

search_by_genes1

The rest of the form unfolds and you can choose the molecular data on which you want to work; then the patients/samples of interest, for example we keep all the samples or only those which have CNA data, or we can put our own list by choosing “User-defined Case List” and putting our list of IDs. Then, we write our genes of interest (separated by a comma if you want to put several), or we can also choose pre-defined lists of genes by clicking on the drop-down menu.

Then we click on “Submit Query”.

search_by_genes2

If you have a lot of samples, it may take a few minutes before the following page appears with multiple tabs.
The number and type of tabs depends on the availability of data from the selected studies.

Oncoprint tab

The first result obtained is an oncoprint.
This representation is widely used in publications analyzing patient cohorts because it allows a quick overview of the distribution of genetic alterations in the cohort.
In this graph, each vertical line corresponds to a patient. Then, horizontally, we have their clinical data (different information depending on the study and modifiable with the “Tracks” drop-down menu), and the alteration information according to our genes of interest, with the percentage of patients with an alteration in these genes and the type of alteration.
At the top of this table we have a banner with buttons for sorting, filtering and zooming in/out.

search_by_genes_oncoprint

Cancer Type Summary tab

This tab presents the same results but in the form of a cumulative histogram by type of cancer and by gene requested. You can choose the gene and cancer type level to plot with the options at the top of the graph.

search_by_genes_cancer_type_summary_tab

Mutual Exclusivity tab

In this tab you can see if the genes in your list are linked. In other words, if one gene is mutated, is the other gene also often mutated (“Co-occurrence”) or on the contrary it never is (“Mutual exclusivity”).

search_by_genes_mutual_exclusivity_tab

Plots tab

In this tab we allow you to interactively generate graphs combining different types of data.

For example, we can look at the expression of TP53 according to the type of Copy Number alteration of TP53, by coloring the TP53 alteration type.

search_by_genes_plots_tab

Or for example, we could look at the variation in the protein level of our gene depending on its gene expression level (RNA).

Mutations tab

This tab corresponds to the tab named “Beta Mutations!” obtained when comparing several groups (except that here we have no comparison). We have the graph of the distribution of mutations along the proteins of our genes of interest, with the corresponding table.

search_by_genes_mutations_tab

Co-Expression tab

The “Co-Expression” tab allows us to know if our genes of interest have genes that are co-expressed with it, in other words, genes whose expression evolves in a similar way across patients.
The results are in tabular form with all genes tested, the most significant at the top. We can also plot a scatterplot of the expression of the 2 genes and a correlation score is calculated. If the mutation status is available the points are colored according to this status.

Example with the Glioblastoma Multiforme study (TCGA GDC, 2025) on the public cBioPortal:

search_by_genes_co_expression_tab

Comparison/Survival tab

This tab itself has several tabs, which are the same as when we compared the 2 groups of samples previously (Overlap, Survival, Clinical, Genomic Alterations, mRNA, Protein, DNA Methylation,…), but here to compare the samples with at least one alteration of our genes of interest against the samples without alteration.
You can also change the groups to compare:
- samples with at least one alteration among all our genes of interest,
- samples without any alteration among our genes of interest,
- samples with at least one alteration in a chosen gene.

search_by_genes_comparison_survival_tab

CN Segments tab

The graphs are the same as for the “CN Segments” tab seen previously, but here we have buttons for each of our genes to go directly to their genomic regions more easily.

search_by_genes_CN_segments_tab

Pathways tab

This tab allows us to display the signaling pathways of our genes of interest.
There are 2 pathway databases available and each has its own viewer:

  • PathwayMapper shows pathways from over fifty cancer related pathways and provides a collaborative web-based editor for creating new ones.
    It displays network signaling pathways as well as the frequency of alteration of genes present in these pathways (if the information is available).
    On the right is the table of results with all the signaling pathways identified (sometimes you can have several pathways for a single gene).
    >Please note if the “Show TCGA PanCancer Atlas pathways only” option is checked then only the pathways presented in the TCGA PanCancer Atlas pathways will be shown (and not all the pathways in PathwayMapper).
    >Note that you can move the pathway bricks to arrange them wherever you want.

search_by_genes_pathways_tab1

  • NDEx shows 1,352 pathways by aggregating several other databases: NCI-PID, Signor, WikiPathways, CPTAC, CCMI and NeST. Here we have the list of signaling pathway on the left and their display on the right.
    The display and legend are different depending on channels because they depend on the database used. But our genes of interest are always boxed in pink.
    You can click on the name of the pathway (in the title displayed in blue) to get information on it, as well as on the genes (nodes) and the links between genes (edges). But the graphs are also dynamic so we can get this information by clicking on the genes or on the links between the genes directly.

search_by_genes_pathways_tab2

Download tab

On the public cBioPortal only, this tab allows you to download all the clinical and genomic data from the samples selected according to our genes of interest.

Make graphs on your own data?

Another interesting feature of cBioPortail is to visualize your own data. You can do: - the oncoprint. - the mutationmapper.

To do this, click on the “Visualize Your Data” tab (at the very top of the page), then on “OncoPrinter” or “MutationMapper”.

make_graphs_on_your_own_data

Please note, all data must be anonymized before upload.

Oncoprint

This allows you to make the same encoprint as seen previously. You can copy and paste your tables directly into the corresponding areas or load them via a file. To get an idea of the format you can click on “Load example data” and/or for an explanation of the format on “View data format”. You can put genomic, clinical and/or heatmap information. Once the information has been entered, you can optionally choose an order of appearance of the genes or samples on the graph; and click on “Submit”.

make_graphs_on_your_own_data_oncoprint1

The oncoplot appears (with genomic information, then clinical, then heatmap) as well as a Mutual Exclusivity analysis.

make_graphs_on_your_own_data_oncoprint2

Mutationmapper

This allows you to create a graph representing mutations along protein domains.
You choose your reference genome (the one used for your analysis), then enter your analysis data.
To see the expected format, you can click on one of the examples given in “Load example data” and/or for an explanation of the format, click on “Data format”.

make_graphs_on_your_own_data_mutationmapper1

Then click on “Visualize”.

make_graphs_on_your_own_data_mutationmapper2

Save your session

To store your virtual studies and groups you can sign in with your Google or Microsoft account on public cBioPortal (similarly with your Gustave Roussy account on the Gustave Roussy cBioPortal). This will allow you to access your studies and groups from any computer, and cBioPortal will also remember your study view charts preferences for each study (i.e. order of the charts, type of charts and visibility).
Login is optional on the public cBioPortal and not required to access any of the other features of cBioPortal.

Also, you can share your patient selection by creating a web link to your selection. Click on the “Save/Share Virtual Study” button, then give a name to you selection, then click on “Save” or “Share”.

save_your_session1 save_your_session2 save_your_session3 You can save the link to your virtutal study, to share it with your colleagues.

Also if you come back to the very first page of the cBioPortal (by clicking on the logo at the top left of the page), you can see your new virtutal study.

save_your_session4

Extract cBioPortal information by R

But what about reproducibility of you research on this web site? Do you remember each button you clicked, especially for study selection and sample/patient selection?
The easiest way is to make an R script.

There are 2 R packages that allow you to retrieve data from cBioPortal: cbioportalR and cBioPortalData. Unfortunately, neither of these 2 packages allows you to recover all the data in an easily analyzable format, nor to create graphs (the graphs will have to be created by yourself).

cbioportalR allows to retrieve data in easily filterable data.frame format but only clinical data, mutations, copy numbers (with segments) and structural variants. Basically, it does not recover gene expression, methylation, protein levels, etc. In addition, it only provides dataframes that are not linked together, so if we filter the mutation table to only keep certain patients (for example), we must filter the clinical data accordingly on the same set of patients.

cBioPortalData allows you to retrieve data in a complex format called MultiAssayExperiment which is an assembly of several other objects of type SummarizedExperiment and RaggedExperiment. This assembly makes it possible to link clinical data to experimental data, and to filter everything in a single block (so we can filter clinical and experimental data at the same time). In my experience, only structural variant data is not retrieved with this package, however, it heavily filters the available information. For example, for mutations, it only loads the position of the mutation (for each mutated gene for each sample) but we no longer have the information concerning the mutated nucleotide (A?T?C?G?), the number of mutated and normal reads (so it is impossible to compute the VAF), nor any annotation (missens? impact on the protein? already known from the annotation databases (COSMIC…)?).

Welcome to the real deep world of bioinformatics. Everything is not always obvious, but we always end up finding a solution with a little agility and imagination.

The shortcomings of these two packages are probably due to the file formats which can be formatted differently depending on the studies. For example for rna, there can be several versions of expression tables: raw counts, in log2, in zscore, or even by gene or by transcript.
Limitations are also attributable to the API (application programming interface) created by the cBioPortal team. An API is an additional functionality of websites allowing information to be retrieved from the site using the command line. So these packages will query the website via the API, but if the API characteristics are limited, the packages will be limited too.

For this course, we suggest using the cbioportalR package as a basis, then using some in-house functions to recover the missing information. The results will be in a dataframe format, easy to filter with functions dedicated to dataframes.

I strongly advise you to have understood the courses on Tables Manipulations and the one on Making graphs with ggplot2. We will consider their content as acquired for the rest of this course.

Setup environment and token

We load cbioportalR R package to get data, and dplyr/tidyr R packages to manipulate data.

# Installation
install.packages("cbioportalR")
#Loading library
library(cbioportalR)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

Before accessing data you will need to connect to a cBioPortal database and set your base URL for the R session.

For the Gustave Roussy cBioPortal

If you want to use the Gustave Roussy cBioPortal, you need to connect you to the GR_Intern network and establish the connection between your R and the intern cBioPortal. For this last point, you need to add a token to your ~/.Renviron file to authorize access:

  1. Go on the cBioPortal interface (https://cbioportal.intra.igr.fr), then retrieve your token (top right):

setup_environment_and_token_for_GR_cBioportal

  1. Modify your R environment by editing the ~/.Renviron file to add a new variable like:
    CBIOPORTAL_TOKEN = ‘YOUR_TOKEN’
#To open the file ~/.Renviron
usethis::edit_r_environ()
  1. Then connect you to “cbioportal.intra.igr.fr” with the set_cbioportal_db() function:
base_url <- "cbioportal.intra.igr.fr"
set_cbioportal_db(base_url)

Token is valid for 30 days. After this time it will be necessary to regenerate it.

For the public cBioPortal

If you want to use the public cBioPortal database instance (https://www.cbioportal.org), you do not need a token to access this public website, and just connect you to “https://www.cbioportal.org” base url with the set_cbioportal_db() function:

base_url <- "https://www.cbioportal.org"
set_cbioportal_db(base_url)
## v You are successfully connected!
## v base_url for this R session is now set to "www.cbioportal.org/api"

In the rest of the course we will use the public cBioPortal database instance, but everything works similarly for the Gustave Roussy cBioPortal.

Identifying available studies

Now that we are successfully connected, we may want to view available studies for our chosen database to find the correct study_id corresponding to the data we want to pull. You can view all studies available in your database with the following:

studies <- available_studies()
head(studies)
## # A tibble: 6 x 14
##   studyId  name  description publicStudy pmid  citation groups status importDate
##   <chr>    <chr> <chr>       <lgl>       <chr> <chr>    <chr>   <int> <chr>     
## 1 cesc_tc~ Cerv~ "Cervical ~ TRUE        2962~ TCGA, C~ "PUBL~      0 2024-12-2~
## 2 sarc_tc~ Sarc~ "Sarcoma T~ TRUE        2962~ TCGA, C~ "PUBL~      0 2024-12-2~
## 3 crc_ori~ Colo~ "Combined ~ TRUE        3938~ Wala, J~ ""          0 2025-06-3~
## 4 crc_ori~ Colo~ "Combined ~ TRUE        3938~ Wala, J~ ""          0 2025-06-3~
## 5 crc_ori~ Colo~ "Combined ~ TRUE        3938~ Wala, J~ ""          0 2025-06-3~
## 6 lusc_tc~ Lung~ "Lung Squa~ TRUE        2962~ TCGA, C~ "PUBL~      0 2024-12-2~
## # i 5 more variables: allSampleCount <int>, readPermission <lgl>,
## #   resourceCounts <list>, cancerTypeId <chr>, referenceGenome <chr>

We get several pieces of information such as the name of the studies, their description, their publication, the date of import, the number of samples,…

The number of available studies:

nrow(studies)
## [1] 50

Hum, only 50 studies? That not a lot! Where are the ~500 studies presented on the web site?
This is a limitation of the API, it returns only the last 50 studies by default.
Unfortunately the cBioPortal package does not offer a solution, so we will have to code that by hand. I’m giving the code to you ready-made, no need to understand it in detail, that’s not the subject of the course here.

#load packages
library(httr) #to send HTTP requests to the API
library(jsonlite) #to convert JSON responses to R objects

#cBioPortal API base address
studies_url <- paste0(base_url, "/api/studies")

#send a GET request to the API
res <- GET(
    studies_url,
    query = list(
      pageSize = 1000, # 1000 studies requested (there are ~500 studies, so that's sufficient for this database)
      pageNumber = 0   # the first page
    ),
    accept_json()     # request JSON format
  )
#format into data.frame
all_studies <- fromJSON(content(res, "text", encoding = "UTF-8"))
#print the first lines of the data.frame
head(all_studies)
##                                                 name
## 1      Metastatic Solid Cancers (UMich, Nature 2017)
## 2     Stomach Adenocarcinoma (TCGA, Firehose Legacy)
## 3 Colorectal Cancer (CAS Shanghai, Cancer Cell 2020)
## 4                 MSK-IMPACT Heme Tumors (MSK, 2022)
## 5      Pancreatic Adenocarcinoma (ICGC, Nature 2012)
## 6 Esophagogastric Cancer (MSK, Clin Cancer Res 2022)
##                                                                                                                                                                                    description
## 1                                         Whole-exome and -transcriptome sequencing of 500 adult patients with metastatic solid tumor/primary normal pairs of diverse lineage and biopsy site.
## 2 TCGA Stomach Adenocarcinoma. Source data from <A HREF="http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/STAD/20160128/">GDAC Firehose</A>. Previously known as TCGA Provisional.
## 3                                  Whole-exome sequencing of 146 colorectal tumor/normal pairs from a chinese cohort, covering 70 metastatic and 76 non-metastatic colorectal cancer patients.
## 4                                                                              Targeted sequencing of 2383 myeloid and lymphoid neoplasms and their matched normals via MSK-IMPACT Heme panel.
## 5                                                                                                                   Whole-exome sequencing of 99 pancreatic samples and their matched normals.
## 6                                                                                                       Targeted sequencing of 237 esophagogastric tumor/normal pairs via MSK-IMPACT platform.
##   publicStudy     pmid                          citation groups status
## 1        TRUE 28783718       Robinson et al. Nature 2017             0
## 2        TRUE     <NA>                              <NA> PUBLIC      0
## 3        TRUE 32888432        Li et al. Cancer Cell 2020 PUBLIC      0
## 4        TRUE     <NA>                              <NA>             0
## 5        TRUE 23103869        Biankin et al. Nature 2012             0
## 6        TRUE 35377946 Smita et al. Clin Cancer Res 2022             0
##            importDate allSampleCount readPermission resourceCounts
## 1 2024-12-09 10:46:46              1           TRUE           NULL
## 2 2025-06-17 12:13:18              1           TRUE           NULL
## 3 2024-12-20 11:02:57              1           TRUE           NULL
## 4 2024-12-16 10:56:34              1           TRUE           NULL
## 5 2025-06-11 22:02:02              1           TRUE           NULL
## 6 2024-12-04 18:47:18              1           TRUE           NULL
##                             studyId cancerTypeId referenceGenome
## 1 metastatic_solid_tumors_mich_2017        mixed            hg19
## 2                         stad_tcga         stad            hg19
## 3                coadread_cass_2020     coadread            hg19
## 4              heme_msk_impact_2022        mixed            hg19
## 5                         paad_icgc         paad            hg19
## 6             egc_msk_tp53_ccr_2022          egc            hg19
#print the number of rows of the data.frame
nrow(all_studies)
## [1] 519

Great! Now there are over 500 studies available!

We can plot the top 20 of cancerTypeId with the largest number of studies:

library(ggplot2)

#get the top 20 cancer types
top_cancers <- all_studies %>%
  count(cancerTypeId) %>%
  top_n(20, n)

#plot
ggplot(top_cancers, aes(x = reorder(cancerTypeId, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # flip axes for readability
  labs(
    title = "Number of Studies per Cancer Type",
    x = "Cancer Type",
    y = "Number of Studies"
  ) +
  theme_classic()

Choose your study of interest

If we want to search for all studies related to glioblastoma (a.k.a. that have the term “glioblastoma” in their name):

as.data.frame(subset(all_studies, grepl("glioblastoma", all_studies$name, ignore.case = TRUE)))
##                                                name
## 113                 Glioblastoma (CPTAC, Cell 2021)
## 185 Glioblastoma Multiforme (TCGA, PanCancer Atlas)
## 265        Glioblastoma Multiforme (TCGA GDC, 2025)
## 266                  Glioblastoma (TCGA, Cell 2013)
## 274 Glioblastoma Multiforme (TCGA, Firehose Legacy)
## 420          Glioblastoma (Columbia, Nat Med. 2019)
## 500                Glioblastoma (TCGA, Nature 2008)
##                                                                                                                                                                                                                                                                description
## 113                                                                                                                        Proteogenomic and metabolomic characterization of human glioblastoma. Whole genome or whole exome sequencing of 99 samples. Generated by CPTAC.
## 185 Glioblastoma Multiforme TCGA PanCancer data. The original data is <a href="https://gdc.cancer.gov/about-data/publications/pancanatlas">here</a>. The publications are <a href="https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html">here</a>.
## 265                                                                TCGA Glioblastoma Multiforme. Source data from <A HREF="https://gdc.cancer.gov">NCI GDC</A> and generated in Aug 2025 using <A HREF="https://cda.readthedocs.io/en/latest/">Cancer Data Aggregator</A>.
## 266                                                                                                                     Whole-exome and/or whole-genome sequencing of 291 of the 577 glioblastoma tumor/normal pairs. The Cancer Genome Atlas (TCGA) Glioblastoma Project.
## 274                                                                           TCGA Glioblastoma Multiforme. Source data from <A HREF="http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/GBM/20160128/">GDAC Firehose</A>. Previously known as TCGA Provisional.
## 420                                                                                                                                                                                    Whole-exome sequencing of 32 out of 42 glioblastomas patients with matched normals.
## 500                                                                                                                  Targeted sequencing in 91 of the 206 primary glioblastoma tumors (143 with matched normals) from the Cancer Genome Atlas (TCGA) Glioblastoma Project.
##     publicStudy
## 113        TRUE
## 185        TRUE
## 265        TRUE
## 266        TRUE
## 274        TRUE
## 420        TRUE
## 500        TRUE
##                                                                                                            pmid
## 113                                                                                                    33577785
## 185 29625048,29596782,29622463,29617662,29625055,29625050,29617662,30643250,32214244,29625049,29850653,36334560
## 265                                                                                                        <NA>
## 266                                                                                                    24120142
## 274                                                                                                        <NA>
## 420                                                                                                    30742119
## 500                                                                                                    18772890
##                     citation        groups status          importDate
## 113    Wang et al. Cell 2021                    0 2025-10-21 16:28:33
## 185          TCGA, Cell 2018 PUBLIC;PANCAN      0 2025-10-21 16:06:43
## 265                     <NA>        PUBLIC      0 2025-10-21 16:33:34
## 266          TCGA, Cell 2013                    0 2025-10-21 15:41:40
## 274                     <NA>        PUBLIC      0 2025-10-21 15:32:47
## 420 Zhao et al. Nat Med 2019                    0 2025-10-21 16:14:34
## 500        TCGA, Nature 2008        PUBLIC      0 2025-10-21 15:38:53
##     allSampleCount readPermission
## 113              1           TRUE
## 185              1           TRUE
## 265              1           TRUE
## 266              1           TRUE
## 274              1           TRUE
## 420              1           TRUE
## 500              1           TRUE
##                                                                             resourceCounts
## 113                                                                                   NULL
## 185 IDC_OHIF_V2, CT Scan, CT Scan, PATIENT, 1, TRUE, 592, 585, gbm_tcga_pan_can_atlas_2018
## 265                                                                                   NULL
## 266                                                                                   NULL
## 274                                                                                   NULL
## 420                                                                                   NULL
## 500                                                                                   NULL
##                         studyId cancerTypeId referenceGenome
## 113              gbm_cptac_2021         difg            hg19
## 185 gbm_tcga_pan_can_atlas_2018         difg            hg19
## 265                gbm_tcga_gdc         difg            hg38
## 266            gbm_tcga_pub2013         difg            hg19
## 274                    gbm_tcga         difg            hg19
## 420           gbm_columbia_2019         difg            hg19
## 500                gbm_tcga_pub         difg            hg19

Note: grepl is a function that allows you to search for one or more words in a vector. It returns a logical vector (TRUE/FALSE) which is given to the subset function which will make the selection in the table. In addition, we ignore the case because in computing upper/lower case letters are discriminatory by default (“Glioblastoma” is different from “glioblastoma”).

For the example, we choose the study named “Glioblastoma Multiforme (TCGA, PanCancer Atlas)”, whose identifier is “gbm_tcga_pan_can_atlas_2018”:

study_id <- "gbm_tcga_pan_can_atlas_2018"

To get more information on our studies, we can do the following:

study_info <- get_study_info(study_id) %>% t()
study_info
##                              [,1]                                                                                                                                                                                                                                                                        
## name                         "Glioblastoma Multiforme (TCGA, PanCancer Atlas)"                                                                                                                                                                                                                           
## description                  "Glioblastoma Multiforme TCGA PanCancer data. The original data is <a href=\"https://gdc.cancer.gov/about-data/publications/pancanatlas\">here</a>. The publications are <a href=\"https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html\">here</a>."
## publicStudy                  "TRUE"                                                                                                                                                                                                                                                                      
## pmid                         "29625048,29596782,29622463,29617662,29625055,29625050,29617662,30643250,32214244,29625049,29850653,36334560"                                                                                                                                                               
## citation                     "TCGA, Cell 2018"                                                                                                                                                                                                                                                           
## groups                       "PUBLIC;PANCAN"                                                                                                                                                                                                                                                             
## status                       "0"                                                                                                                                                                                                                                                                         
## importDate                   "2025-10-21 16:06:43"                                                                                                                                                                                                                                                       
## allSampleCount               "1"                                                                                                                                                                                                                                                                         
## sequencedSampleCount         "397"                                                                                                                                                                                                                                                                       
## cnaSampleCount               "575"                                                                                                                                                                                                                                                                       
## mrnaRnaSeqSampleCount        "0"                                                                                                                                                                                                                                                                         
## mrnaRnaSeqV2SampleCount      "160"                                                                                                                                                                                                                                                                       
## mrnaMicroarraySampleCount    "0"                                                                                                                                                                                                                                                                         
## miRnaSampleCount             "0"                                                                                                                                                                                                                                                                         
## methylationHm27SampleCount   "0"                                                                                                                                                                                                                                                                         
## rppaSampleCount              "231"                                                                                                                                                                                                                                                                       
## massSpectrometrySampleCount  "0"                                                                                                                                                                                                                                                                         
## completeSampleCount          "145"                                                                                                                                                                                                                                                                       
## readPermission               "TRUE"                                                                                                                                                                                                                                                                      
## treatmentCount               "448"                                                                                                                                                                                                                                                                       
## structuralVariantCount       "123"                                                                                                                                                                                                                                                                       
## resourceCounts.resourceId    "IDC_OHIF_V2"                                                                                                                                                                                                                                                               
## resourceCounts.displayName   "CT Scan"                                                                                                                                                                                                                                                                   
## resourceCounts.description   "CT Scan"                                                                                                                                                                                                                                                                   
## resourceCounts.resourceType  "PATIENT"                                                                                                                                                                                                                                                                   
## resourceCounts.priority      "1"                                                                                                                                                                                                                                                                         
## resourceCounts.openByDefault "TRUE"                                                                                                                                                                                                                                                                      
## resourceCounts.sampleCount   "592"                                                                                                                                                                                                                                                                       
## resourceCounts.patientCount  "585"                                                                                                                                                                                                                                                                       
## resourceCounts.studyId       "gbm_tcga_pan_can_atlas_2018"                                                                                                                                                                                                                                               
## studyId                      "gbm_tcga_pan_can_atlas_2018"                                                                                                                                                                                                                                               
## cancerTypeId                 "difg"                                                                                                                                                                                                                                                                      
## cancerType.name              "Diffuse Glioma"                                                                                                                                                                                                                                                            
## cancerType.dedicatedColor    "Gray"                                                                                                                                                                                                                                                                      
## cancerType.shortName         "DIFG"                                                                                                                                                                                                                                                                      
## cancerType.parent            "brain"                                                                                                                                                                                                                                                                     
## cancerType.cancerTypeId      "difg"                                                                                                                                                                                                                                                                      
## referenceGenome              "hg19"

To view the list of data available in this study:

study_profiles <- available_profiles(study_id) %>% as.data.frame()
study_profiles
##    molecularAlterationType genericAssayType    datatype
## 1            GENERIC_ASSAY     ARMLEVEL_CNA CATEGORICAL
## 2            GENERIC_ASSAY GENETIC_ANCESTRY LIMIT-VALUE
## 3   COPY_NUMBER_ALTERATION             <NA>    DISCRETE
## 4   COPY_NUMBER_ALTERATION             <NA>  LOG2-VALUE
## 5            GENERIC_ASSAY      METHYLATION LIMIT-VALUE
## 6        MUTATION_EXTENDED             <NA>         MAF
## 7          MRNA_EXPRESSION             <NA>  CONTINUOUS
## 8          MRNA_EXPRESSION             <NA>     Z-SCORE
## 9          MRNA_EXPRESSION             <NA>     Z-SCORE
## 10           PROTEIN_LEVEL             <NA>  LOG2-VALUE
## 11           PROTEIN_LEVEL             <NA>     Z-SCORE
## 12      STRUCTURAL_VARIANT             <NA>          SV
##                                                                      name
## 1                              Putative arm-level copy-number from GISTIC
## 2                                                        Genetic Ancestry
## 3                            Putative copy-number alterations from GISTIC
## 4                                                 Log2 copy-number values
## 5                                      Methylation (HM27 and HM450 merge)
## 6                                                               Mutations
## 7   mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)
## 8  mRNA expression z-scores relative to diploid samples (RNA Seq V2 RSEM)
## 9  mRNA expression z-scores relative to all samples (log RNA Seq V2 RSEM)
## 10                                              Protein expression (RPPA)
## 11                                     Protein expression z-scores (RPPA)
## 12                                                    Structural variants
##                                                                                                                                                                                                                                                                                                                                         description
## 1                                                                                                                                                                                                                                                                                                   Putative arm-level copy-number from GISTIC 2.0.
## 2  Genetic ancestries were determined using five different methods as described in Carrot-Zhang et al (2020). These consensus calls were created based on the ancestral population that received the majority of assignments for each patient. The original data is <a href="https://gdc.cancer.gov/about-data/publications/CCG-AIM-2020">here</a>.
## 3                                                                                                                                                                                Putative copy-number from GISTIC 2.0. Values: -2 = homozygous deletion; -1 = hemizygous deletion; 0 = neutral / no change; 1 = gain; 2 = high level amplification.
## 4                                                                                                                                                                                                                                                                                     Log2 copy-number values for each gene (from Affymetrix SNP6).
## 5                                                                                                                                                                                                                                                                               Methylation between-platform (hm27 and hm450) normalization values.
## 6                                                                                                                                                                                                                                                                            Mutation data from whole exome sequencing of 592 Glioblastoma samples.
## 7                                                                                                                                                                                                                                                                             mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)
## 8                                                                                                                                                                                                            mRNA expression z-scores (RNA Seq V2 RSEM) compared to the expression distribution of each gene tumors that are diploid for this gene.
## 9                                                                                                                                                                                                                                Log-transformed mRNA expression z-scores compared to the expression distribution of all samples (RNA Seq V2 RSEM).
## 10                                                                                                                                                                                                                                                                                       Protein expression measured by reverse-phase protein array
## 11                                                                                                                                                                                                                                                                            Protein expression, measured by reverse-phase protein array, Z-scores
## 12                                                                                                                                                                                                                                                                                                                         Structural Variant Data.
##    showProfileInAnalysisTab patientLevel
## 1                      TRUE        FALSE
## 2                      TRUE        FALSE
## 3                      TRUE        FALSE
## 4                     FALSE        FALSE
## 5                      TRUE        FALSE
## 6                      TRUE        FALSE
## 7                     FALSE        FALSE
## 8                      TRUE        FALSE
## 9                      TRUE        FALSE
## 10                    FALSE        FALSE
## 11                     TRUE        FALSE
## 12                     TRUE        FALSE
##                                                       molecularProfileId
## 1                               gbm_tcga_pan_can_atlas_2018_armlevel_cna
## 2                           gbm_tcga_pan_can_atlas_2018_genetic_ancestry
## 3                                     gbm_tcga_pan_can_atlas_2018_gistic
## 4                                    gbm_tcga_pan_can_atlas_2018_log2CNA
## 5               gbm_tcga_pan_can_atlas_2018_methylation_hm27_hm450_merge
## 6                                  gbm_tcga_pan_can_atlas_2018_mutations
## 7                            gbm_tcga_pan_can_atlas_2018_rna_seq_v2_mrna
## 8             gbm_tcga_pan_can_atlas_2018_rna_seq_v2_mrna_median_Zscores
## 9  gbm_tcga_pan_can_atlas_2018_rna_seq_v2_mrna_median_all_sample_Zscores
## 10                                      gbm_tcga_pan_can_atlas_2018_rppa
## 11                              gbm_tcga_pan_can_atlas_2018_rppa_Zscores
## 12                       gbm_tcga_pan_can_atlas_2018_structural_variants
##                        studyId sortOrder
## 1  gbm_tcga_pan_can_atlas_2018      <NA>
## 2  gbm_tcga_pan_can_atlas_2018       ASC
## 3  gbm_tcga_pan_can_atlas_2018      <NA>
## 4  gbm_tcga_pan_can_atlas_2018      <NA>
## 5  gbm_tcga_pan_can_atlas_2018      DESC
## 6  gbm_tcga_pan_can_atlas_2018      <NA>
## 7  gbm_tcga_pan_can_atlas_2018      <NA>
## 8  gbm_tcga_pan_can_atlas_2018      <NA>
## 9  gbm_tcga_pan_can_atlas_2018      <NA>
## 10 gbm_tcga_pan_can_atlas_2018      <NA>
## 11 gbm_tcga_pan_can_atlas_2018      <NA>
## 12 gbm_tcga_pan_can_atlas_2018      <NA>

Download data

Now that we have chosen our study we will be able to download its data.

Get clinical data

#get the list of all available samples
sampleList <- available_samples(study_id = study_id)

#get clinical data from the list of available samples
clinical_data <- get_clinical_by_sample(sample_id = sampleList$sampleId,
                                        study_id  = study_id)
## ! No `clinical_attribute` passed. Defaulting to returning
## all clinical attributes in "gbm_tcga_pan_can_atlas_2018" study
#format to get one line per patient
clinical_data <- clinical_data %>% pivot_wider(names_from = "clinicalAttributeId") %>% as.data.frame()

#print first lines
head(clinical_data)
##                                              uniqueSampleKey
## 1 VENHQS0wMi0yNDY2LTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
## 2 VENHQS0wMi0yNDcwLTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
## 3 VENHQS0wMi0yNDgzLTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
## 4 VENHQS0wMi0yNDg1LTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
## 5 VENHQS0wMi0yNDg2LTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
## 6 VENHQS0wNi0xMDg0LTAxOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA
##                                         uniquePatientKey        sampleId
## 1 VENHQS0wMi0yNDY2OmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-02-2466-01
## 2 VENHQS0wMi0yNDcwOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-02-2470-01
## 3 VENHQS0wMi0yNDgzOmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-02-2483-01
## 4 VENHQS0wMi0yNDg1OmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-02-2485-01
## 5 VENHQS0wMi0yNDg2OmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-02-2486-01
## 6 VENHQS0wNi0xMDg0OmdibV90Y2dhX3Bhbl9jYW5fYXRsYXNfMjAxOA TCGA-06-1084-01
##      patientId                     studyId ANEUPLOIDY_SCORE  CANCER_TYPE
## 1 TCGA-02-2466 gbm_tcga_pan_can_atlas_2018               11 Glioblastoma
## 2 TCGA-02-2470 gbm_tcga_pan_can_atlas_2018                5 Glioblastoma
## 3 TCGA-02-2483 gbm_tcga_pan_can_atlas_2018                4 Glioblastoma
## 4 TCGA-02-2485 gbm_tcga_pan_can_atlas_2018                8 Glioblastoma
## 5 TCGA-02-2486 gbm_tcga_pan_can_atlas_2018                8 Glioblastoma
## 6 TCGA-06-1084 gbm_tcga_pan_can_atlas_2018                7 Glioblastoma
##      CANCER_TYPE_DETAILED FRACTION_GENOME_ALTERED MSI_SCORE_MANTIS
## 1 Glioblastoma Multiforme                  0.3380           0.2855
## 2 Glioblastoma Multiforme                  0.1140           0.2735
## 3 Glioblastoma Multiforme                  0.2253           0.2721
## 4 Glioblastoma Multiforme                  0.1883           0.2728
## 5 Glioblastoma Multiforme                  0.2043           0.2683
## 6 Glioblastoma Multiforme                  0.2901           0.2907
##   MSI_SENSOR_SCORE MUTATION_COUNT ONCOTREE_CODE SAMPLE_TYPE SOMATIC_STATUS
## 1             0.86             99           GBM     Primary        Matched
## 2             0.02             50           GBM     Primary        Matched
## 3              0.3             45           GBM     Primary        Matched
## 4             0.15             54           GBM     Primary        Matched
## 5             0.04             57           GBM     Primary        Matched
## 6              0.3             90           GBM     Primary        Matched
##   TBL_SCORE        TISSUE_SOURCE_SITE TISSUE_SOURCE_SITE_CODE TMB_NONSYNONYMOUS
## 1        93 MD Anderson Cancer Center                       2       3.366666667
## 2        31 MD Anderson Cancer Center                       2               1.7
## 3       102 MD Anderson Cancer Center                       2               1.5
## 4        33 MD Anderson Cancer Center                       2       1.833333333
## 5        75 MD Anderson Cancer Center                       2               1.9
## 6        83       Henry Ford Hospital                       6                 3
##   TUMOR_TISSUE_SITE                               TUMOR_TYPE
## 1             Brain   Glioblastoma Multiforme (GBM), Treated
## 2             Brain   Glioblastoma Multiforme (GBM), Treated
## 3             Brain Glioblastoma Multiforme (GBM), Untreated
## 4             Brain Glioblastoma Multiforme (GBM), Untreated
## 5             Brain Glioblastoma Multiforme (GBM), Untreated
## 6             Brain Glioblastoma Multiforme (GBM), Untreated
##   TISSUE_PROSPECTIVE_COLLECTION_INDICATOR
## 1                                    <NA>
## 2                                    <NA>
## 3                                    <NA>
## 4                                    <NA>
## 5                                    <NA>
## 6                                    <NA>
##   TISSUE_RETROSPECTIVE_COLLECTION_INDICATOR
## 1                                      <NA>
## 2                                      <NA>
## 3                                      <NA>
## 4                                      <NA>
## 5                                      <NA>
## 6                                      <NA>

Citation

If you use cBioPortal in your reseach don’t forget to cite them: - Cerami et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discovery. May 2012 2; 401. PubMed.https://pubmed.ncbi.nlm.nih.gov/22588877/ - Gao et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6, pl1 (2013). PubMed. https://pubmed.ncbi.nlm.nih.gov/23550210/ - de Bruijn et al. Analysis and Visualization of Longitudinal Genomic and Clinical Data from the AACR Project GENIE Biopharma Collaborative in cBioPortal. Cancer Res (2023). PubMed. https://pubmed.ncbi.nlm.nih.gov/37668528/

Remember also to cite the source of the data if you are using a publicly available dataset.